Sparse Principal Components Analysis
نویسندگان
چکیده
Principal components analysis (PCA) is a classical method for the reduction of dimensionality of data in the form of n observations (or cases) of a vector with p variables. Contemporary data sets often have p comparable to, or even much larger than n. Our main assertions, in such settings, are (a) that some initial reduction in dimensionality is desirable before applying any PCA-type search for principal modes, and (b) the initial reduction in dimensionality is best achieved by working in a basis in which the signals have a sparse representation. We describe a simple asymptotic model in which the estimate of the leading principal component vector via standard PCA is consistent if and only if p(n)/n → 0. We provide a simple algorithm for selecting a subset of coordinates with largest sample variances, and show that if PCA is done on the selected subset, then consistency is recovered, even if p(n) ≫ n. Our main setting is that of signals and images, in which the number of sampling points, or pixels, is often comparable with or larger than the number of cases, n. Our particular example here is the electrocardiogram (ECG) signal of the beating heart, but similar approaches have been used, say, for PCA on libraries of face images. Standard PCA involves an O(min(p, n)) search for directions of maximum variance. But if we have some a priori way of selecting k ≪ min(n, p) coordinates in which most of the variation among cases is to be found, then the complexity of PCA is much reduced, to O(k). This is a computational reason, but if there is instrumental or other observational noise in each case that is uncorrelated with or independent of relevant case-to-case variation, then there is another compelling reason to preselect a small subset of variables before running PCA. Indeed, we construct a model of factor analysis type and show that ordinary PCA can produce a consistent (as n → ∞) estimate of the principal factor if and only if p(n) is asymptotically of smaller order than n. Heuristically, if p(n) ≥ cn, there is so much observational noise and so many dimensions over which to search, that a spurious noise maximum will always drown out the true factor. Fortunately, it is often reasonable to expect such small subsets of variables to exist: Much recent research in signal and image analysis has sought orthonormal basis and related systems in which typical signals have sparse representations: most co-ordinates have small signal energies. If such a basis is used to represent a signal – we use wavelets as the classical example here – then the variation in many coordinates is likely to be very small. Consequently, we study a simple “sparse PCA” algorithm with the following ingredients: a) given a suitable orthobasis, compute coefficients for each case, b) compute sample variances (over cases) for each coordinate in the basis, and select the k coordinates of largest sample variance, c) run standard PCA on the selected k coordinates, obtaining up to k estimated eigenvectors, d) if desired, use soft or hard thresholding to denoise these estimated eigenvectors, and e) re-express the (denoised) sparse PCA eigenvector estimates in the original signal domain. We illustrate the algorithm on some exercise ECG data, and also develop theory to show in a single factor model, under an appropriate sparsity assumption, that it indeed overcomes the inconsistency problems when p(n) ≥ cn, and yields consistent estimates of the principal factor.
منابع مشابه
Sparse Structured Principal Component Analysis and Model Learning for Classification and Quality Detection of Rice Grains
In scientific and commercial fields associated with modern agriculture, the categorization of different rice types and determination of its quality is very important. Various image processing algorithms are applied in recent years to detect different agricultural products. The problem of rice classification and quality detection in this paper is presented based on model learning concepts includ...
متن کاملA New IRIS Segmentation Method Based on Sparse Representation
Iris recognition is one of the most reliable methods for identification. In general, itconsists of image acquisition, iris segmentation, feature extraction and matching. Among them, iris segmentation has an important role on the performance of any iris recognition system. Eyes nonlinear movement, occlusion, and specular reflection are main challenges for any iris segmentation method. In thi...
متن کاملA New IRIS Segmentation Method Based on Sparse Representation
Iris recognition is one of the most reliable methods for identification. In general, itconsists of image acquisition, iris segmentation, feature extraction and matching. Among them, iris segmentation has an important role on the performance of any iris recognition system. Eyes nonlinear movement, occlusion, and specular reflection are main challenges for any iris segmentation method. In thi...
متن کاملRobust Sparse Principal Component Analysis
A method for principal component analysis is proposed that is sparse and robust at the same time. The sparsity delivers principal components that have loadings on a small number of variables, making them easier to interpret. The robustness makes the analysis resistant to outlying observations. The principal components correspond to directions that maximize a robust measure of the variance, with...
متن کاملSparse Principal Component Analysis
Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified...
متن کاملRobust Sparse 2D Principal Component Analysis for Object Recognition
We extensively investigate robust sparse two dimensional principal component analysis (RS2DPCA) that makes the best of semantic, structural information and suppresses outliers in this paper. The RS2DPCA combines the advantages of sparsity, 2D data format and L1-norm for data analysis. We also prove that RS2DPCA can offer a good solution of seeking spare 2D principal components. To verify the pe...
متن کامل